Browsing Fatigue in Handhelds: Semantic Bookmarking 
Spells Relief 


Saikat Mukherjee 
Dept. of Computer Science 
Stony Brook University 
Stony Brook, NY, 11794, USA 


saikat@cs.sunysb.edu 


ABSTRACT 


Focused Web browsing activities such as periodically look- 
ing up headline news, weather reports, etc., which require 
only selective fragments of particular Web pages, can be 
made more efficient for users of limited-display-size hand- 
held mobile devices by delivering only the target fragments. 
Semantic bookmarks provide a robust conceptual framework 
for recording and retrieving such targeted content not only 
from the specific pages used in creating the bookmarks but 
also from any user-specified page with similar content se- 
mantics. This paper describes a technique for realizing se- 
mantic bookmarks by coupling machine learning with Web 
page segmentation to create a statistical model of the book- 
marked content. These models are used to identify and re- 
trieve the bookmarked content from Web pages that share a 
common content domain. In contrast to ontology-based ap- 
proaches where semantic bookmarks are limited to available 
concepts in the ontology, the learning-based approach allows 
users to bookmark ad-hoc personalized semantic concepts 
to effectively target content that fits the limited display of 
handhelds. User evaluation measuring the effectiveness of a 
prototype implementation of learning-based semantic book- 
marking at reducing browsing fatigue in handhelds is pro- 
vided. 
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1.7.5 [Document and Text Processing]: Document Cap- 
ture—Document Analysis; H.3.3 [Information Systems]: 
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1. INTRODUCTION 


Handheld mobile devices such as PDAs and cell phones, 
with browsers and processors embedded in them, are becom- 
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ing popular as Web browsing gadgets “on-the-go”. However, 
their limited display size forces users to scroll tediously us- 
ing various buttons to view the desired content. This makes 
browsing with handhelds a tedious and fatigue-inducing task. 
Hence, adapting Web content so as to make browsing with 
handhelds more efficient is an important problem that has 
been drawing serious research attention. 

Initial approaches to adapting Web content onto hand- 
helds [5, 21, 23] placed the burden on content providers to 
script Web pages specifically for such limited display devices. 
More recent techniques [7, 9, 12, 35] propose heuristics for 
adapting the content of the entire Web page into hierarchical 
structures summarizing the content. While they are quite 
effective for exploratory browsing, there are many scenarios 
where the user repeatedly needs targeted data from specific 
Web sites. Such periodic revisits usually signify the user’s 
interest in certain specific content in these pages — e.g. the 
user may periodically browse news portals to read breaking 
news. In such situations, adapting the content of the entire 
Web page will require the user to repeatedly and needlessly 
navigate the summary structure. On the other hand deliv- 
ering focused content constituting only the desired fragment 
of an entire page to handhelds obviates the need for needless 
scrolling thereby reducing stress and fatigue. 

Bookmarks provide the user with direct access to pages 
containing specific, highly targeted content of interest. Tra- 
ditionally, creating a bookmark amounts to saving the URL 
of the page while retrieval fetches the entire page. However, 
for adapting this operational aspect of bookmarks to hand- 
helds with limited display one has to focus exclusively on 
the target content. This requires associating with the book- 
mark both the URL of the page as well as extraction expres- 
sions that when applied to the page will retrieve the desired 
content. In fact, research in wrapper-based data extraction 
techniques [24] have focused on building such expressions 
using various syntactic cues surrounding the target content 
in a page. However, wrappers are learned per page and are 
also brittle to structural variations in the page. Thus, they 
are not only difficult to scale across pages but are also hard 
to maintain over time. 

We can overcome the above limitations using the notion of 
semantic bookmarks. A semantic bookmark associates con- 
tent segments in Web pages, even from different Web sites, 
with a “concept” from an application domain. Informally, a 
concept represents an abstract entity that is associated with 
some properties. For example, the news domain will consist 
of concepts such as Taxonomy news, Major Headline news, 
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Figure 1: (a) New York Times front page (b) Los Angeles Times front page (c) Los Angeles Major Headlines 


instance on a PocketPC Emulator 


Category news, etc. As far as properties go, Major Head- 
line news items are characterized by a link labeled with the 
headline text, the news source, and a brief summary. Oc- 
currence of a concept in a page is said to be its instance. 
In Figures 1(a) and (b) the rectangular portions on the left- 
most columns are instances of Taxonomy news while the el- 
liptical portions are Major Headline news instances and the 
rectangular portion on the rightmost column in Figure 1(a) 
is a Category news instance. For the end user, creating a 
semantic bookmark amounts to merely highlighting (some) 
concept instances in (a few) Web pages. Retrieval of a se- 
mantic bookmark, on the other hand, means not only ex- 
tracting the concept instances from the Web pages used to 
create it but also from any page in any other site (specified 
by the user) where the concept can occur. For example, if 
the user creates the semantic bookmark of Major Headline 
news from the front page of New York Times then it should 
be possible to retrieve headline news items from Los Ange- 
les Times front page also using this bookmark even though 
Los Angeles Times was not used for creating the bookmark. 
Observe that in contrast to a wrapper the scope of a seman- 
tic bookmark extends to all those pages across sites with 
similar content semantics, i.e. it is scalable. 

In this paper we explore the idea of realizing semantic 
bookmarks by judiciously combining machine learning with 
Web page segmentation. Broadly speaking the method is 
this: Organizing a Web page into its logical structure amounts 
to creating a tree of partitions each of which aggregates 
items in the page with similar content semantics. The user 
highlights a (small) set of example partitions, possibly from 
different partition trees, as instances of the concept to be 
bookmarked. From these labeled nodes a statistical model 
of the features in the bookmarked content are learned. The 
learned concept models are then applied to identify and re- 
trieve concept instances from any other partition tree that 
shares a common content domain (e.g. news portals, travel 
portals, etc.) and delivered to the handheld. Figure 1(c) 
shows the major headlines new fragment of the Los Angeles 
Times front page on a PocketPC handheld. 

An alternative implementation of semantic bookmarking 
is through ontologies, a computational vehicle for captur- 
ing machine processable knowledge about an application do- 
main. Specifically, this knowledge is represented explicitly 
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in the ontology as domain concepts, their features, and re- 
lationships among them. In this approach, the ontology will 
identify the concept instances present in the page which can 
then be saved as semantic bookmarks. The idea of using on- 
tologies for implementing semantic bookmarking was men- 
tioned in [17] and [27] within the larger context of creating a 
semantic layer over Web pages and for assistive browsing re- 
spectively. However, content delivery to handhelds was not 
the focus of those works. The problem with ontology-based 
approaches is that it limits semantic bookmarking to con- 
cepts present a priori in the ontology. Since an ontology may 
not necessarily be extensive, concepts that a user is inter- 
ested in capturing may not be present in the ontology. Our 
learning-based approach presents a more flexible paradigm 
where ad-hoc personalized semantic concepts can be defined 
and bookmarked by users. And of course learning-based 
semantic bookmarking of ad-hoc concepts can also nicely 
complement ontology-based approaches. 

The rest of the paper is organized as follows. In Section 2, 
we describe machine learning based techniques for creating 
and retrieving semantic bookmarks. Section 3 contains user 
evaluation based on the implementation of a prototype sys- 
tem. It measures the effectiveness of semantic bookmarking 
on reducing fatigue induced by browsing using handheld de- 
vices. Sections 4 and 5 contain related work and discussions 
respectively. 


2. LEARNING SEMANTIC BOOKMARKS 


Our approach to learning semantic bookmarks rests on 
two processes: (i) inferring the logical structure of the Web 
page via structural analysis of its content, and (ii) learning 
the salient features present in the content of the partitions 
in the logical structure to build a statistical model of the 
concept to be bookmarked. The learned statistical model is 
then used for retrieving instances of the bookmarked con- 
cept. In our earlier works in [29] and [28] respectively, we 
had proposed a structural analysis algorithm for partition- 
ing Web pages and a technique for learning features to anno- 
tate Web pages. In this paper we develop a computational 
framework for semantic bookmarking using handhelds by 
tightly integrating the techniques in these two works. We 
will briefly review the ideas underlying them in this Section. 
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Figure 2: (a) DOM fragment of the New York Times 


fragment 


2.1 Structural Analysis 


The essence of our partitioning idea is that consistency in 
presentation style and spatial locality of semantically related 
items in Web pages can be exploited to discover sequential 
patterns in the DOM structure of a page. We have used a 
simple typing system for nodes in the DOM tree to capture 
these sequential patterns. 

A primitive type encodes the presentation style (including 
visual cues such as font type and size) of a piece of text that 
corresponds to a leaf node in a DOM tree. The type of a 
leaf node is the sequence of HTML tags, with their attribute 
values, on the path from the root of the DOM tree to the 
node. For example, in the DOM tree of Figure 2(a) (which 
corresponds to the taxonomy and major headlines fragments 
of Figure 1(a)), all the leaf nodes corresponding to the main 
taxonomic items, “NEWS”, “OPINION”, “FEATURES”, 
..., etc., have the same primitive type, tr-td-table-tr-td-img. 
Let us denote this type as Tı. Further, observe that all the 
subtaxonomic items, such as “International”, “National”, 
..., etc., under each main taxonomic item, such as “NEWS”, 
have the same primitive type, tr-td-table-tr-td-a- fonto. 
Let us denote it using T2. 

A compound type summarizes the structural recurrence 
information at a subtree rooted at an internal node. Note 
that in Figure 2(a) the subtree rooted at the table node 
(shown in circle) groups together several main taxonomic 
items each of which is followed by a number of subtaxonomic 
items, i.e., the entire taxonomy is clustered under this single 
DOM tree node. This property of spatial locality combined 
with consistency in presentation style reveals structural re- 
currence information about semantically related items. Ob- 
serve that the sequence of primitive types of the leaf nodes in 
the subtree rooted at table is: T;7T2T>...7,T2T>2.... In this 
string the sequential pattern, TıT% (here x denotes Kleene 
closure), exactly captures the structural recurrence infor- 
mation of each semantically related item (7.e., a main tax- 
onomic item followed by a number of subtaxonomic items). 
Thus, the pattern T;T becomes the compound type of this 
table node. 

Therefore, as illustrated by the example above, the idea 
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underlying structural analysis is to discover sequential pat- 
terns on the typed sequence of nodes in a DOM tree. Given 
any two types as defined above, their equivalence is defined 
straightforwardly: two types are equivalent if and only if 
they are syntactically the same. Our structural analysis 
algorithm is built on the notion of maximal repeating sub- 
strings which is the smallest repeating substring with max- 
imal coverage in the original string. 

Since semantically related items exhibit spatial locality, 
structural analysis can be performed recursively bottom-up 
starting from the leaf nodes of the DOM tree of a HTML 
document. First, primitive types are assigned to all leaf 
nodes. The type of an internal node with only one child 
node is the same as that of the child. For internal nodes with 
multiple children, the type of each child is first computed 
and then the sequence of types belonging to all the children 
are analyzed for patterns. 

Analysis of type sequences for pattern detection is an it- 
erative process. In the first step, consecutive nodes having 
equivalent types are collapsed into a single node. The intu- 
ition behind this is that they all relate to the same item. We 
denote this node as a group node. The type of this group 
node is the same as any of it’s child. Next, the modified se- 
quence is analyzed for maximal repeating substrings. Every 
sequence of consecutive nodes whose types match the max- 
imal repeat are collapsed under single nodes. These nodes 
are denoted as pattern nodes. The type of this pattern node 
is the sequence of types in the repeat. This procedure of 
grouping and pattern mining is repeated until no more pat- 
terns can be detected. If the iterations do not terminate in 
a single group node then the remaining non-pattern nodes 
are merged with their preceding pattern nodes to create a 
set of pattern nodes below a group node. 

We illustrate pattern detection using an example type 
sequence 7172737273147, 727375. Observe that 7273 is a 
maximal repeating substring. Let us use a new type Te to 
denote the pattern 7273. Then after the first iteration, the 
type sequence becomes 717676147 T6Ts. The first two oc- 
currences of Tę can be collapsed into a group node, resulting 
in 7, 76747) T6Ts, in which TiTe is a maximal repeating sub- 
string. Again, we use a new type Ty to represent the pattern 
TıT6. So after the second iteration the type sequence be- 
comes 77747T7T;. No more patterns can be detected and 


the iterations stop. Finally, T4 is merged with it’s preceding 
pattern node Ty while Ts with it’s preceding T7. Ty is the 
type assigned to the ultimate group node. 

Figure 2(b) shows the result of our partitioning technique 
on the DOM fragment of New York Times in Figure 2(a). 
Intuitively, a group node in the partition tree aggregates 
repeated occurrences of items that are semantically simi- 
lar while a pattern node encapsulates each such item. For 
instance, in Figure 2(b), the dotted circled group node ag- 
gregates occurrences of Major Headlines news concept. The 
pattern nodes below this group node correspond to every 
individual Major Headlines news item. 


2.2 Concept Model 


The statistical model of a concept is developed from fea- 
tures learned from the set of partition tree nodes which 
are labeled as its instances. Given any partition tree fea- 
ture learning generates a set of features, with corresponding 
weights, at every node in the tree. During training, the 
probability of occurrence of a feature in a concept is com- 
puted by a simple frequency counting and smoothing based 
maximum likelihood approach. 

The content of a partition as well as the style with which 

the content is presented are both utilized to learn unstruc- 
tured and structured features. However, our learning-based 
framework is quite general and other kinds of features can 
be accommodated in it. 
Unstructured Features: After eliminating stop-words the 
bag of words in the partition tree constitute the unstruc- 
tured elements in the feature space. Each feature element is 
assigned a weight at every node in the partition tree. 

At a leaf node p; of a partition tree, the weight of a feature 
is the number of its occurrences in the text of p;. The weight 
of a feature at an internal partition tree node p; is the sum 
of its weights from the immediate children nodes of p;. In 
this way, weights of features are propagated bottom-up the 
tree. 

However, sometimes it is necessary to utilize the parti- 
tion tree structure even further to assign higher weights to 
more informative features. It is often the case that Web 
page designers group together related content under cer- 
tain words (e.g., “BUSINESS,” groups together the articles 
“Dow Prunes ..”, “Oil Prices ..”, and “G.M., ..” in Fig- 
ure 1(a)). We should assign a relatively higher weight to 
such words since they are in some sense the “constant” fea- 
tures of the content. When constructing the partition tree 
the non-constant items become children of a group node p; 
and the constant item p, becomes the sibling of p;. To- 
gether they appear as the children of a pattern node pi. 
(See illustration of this process in Figure 2(b) for taxonomy 
news). Under these circumstances, the weights of features 
in py are multiplied by a factor equal to the number of chil- 
dren in py. For instance, in the partition tree corresponding 
to the page in Figure 1(a), the weight of the feature “BUSI- 
NESS” will be increased by the number of children in its 
sibling group node (3 in this case). Subsequently, bottom- 
up aggregation is performed as described before. 
Structured Features: Whereas unstructured features rep- 
resent important words that appear in the textual content 
of partitions, structured features capture the presentational 
aspects of their content. For instance, in Figure 1(a), each 
Major Headline news item is presented as a link (“Bush 
Aides..”), followed by two consecutive text strings (“By..”, 
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“The commission..”). Some news items also include an op- 
tional link (e.g. “Complete..” in news item 2). Abstractly 
speaking the presentation style is captured by the sequence: 
link - text - text- ?link where ?link means that this link may 
not always be present in all headline news items (akin to the 
? operator used in the language of regular expressions). 

The structured feature of a leaf node is either a link or 
text since leaf nodes in the partition tree contain either hy- 
perlinks or text strings.? Hyperlink leaf nodes have only a 
link feature with a weight of 1 while text leaf nodes have 
only a text feature with unit weight. We propagate the 
structured features of the leaf nodes up the tree to con- 
struct the structured features of the internal nodes and as- 
sign weights to them. The structured features of internal 
nodes are constructed thus: If an internal node is not a pat- 
tern node then its structured feature set is just the union 
of its children’s structured features. The weight of each 
feature in this set is the cumulative sum of the feature’s 
weight in each of the node’s children. Besides this unioned 
set of features, each pattern node also has an additional fea- 
ture which reflects the repetitive structure associated with 
it. The repetitive structure of a pattern node is captured 
in this additional feature by concatenating the structured 
features of the node’s children. Since we want to make a 
determination of concept instances using features that will 
always be present, features representing the optional aspect 
of the pattern are omitted. The synthesized concatenated 
feature is assigned a weight of 1 at the pattern node. 

For instance, in Figure 2(b), the leaf partitions “Bush 
Aides..”, “By PHILIP..”, and “The commission..” have 
structured features link, text, and text respectively. Simi- 
larly, the leaf partitions “Mix of..”, “By JEFFREY..”, “U.S. 
political..”, and “Complete..” have the features link, text, 
text, and link respectively. Structural analysis on the entire 
sequence of major headlines, shown in Figure 1(a), yields the 


set of structured features {(link-text-text, 1), (link, 1), (text, 2) } 


for the first pattern node. Similarly, the second pattern node 
has {(link-text-text, 1), (link, 2), (text, 2)} as its set of struc- 
tured features. Note the link element denoting “Complete 
..” is optional and hence is discarded from the structured 
feature set of the 2nd pattern node. Finally, the set of struc- 
tured features at the group node (considering these two pat- 
tern nodes only) is {(link - text: text, 2), (link, 3), (text, 4) }. 


2.3 Concept Detection 


The objective now is to use the learned model to identify 
concept instances in the partition tree of a new Web page. 
The likelihood of any node in this tree being an instance of 
a concept is computed using a multinomial distribution on 
the features at that node and probabilities of occurrences of 
features in the concept. However, to cope with false positives 
and ambiguities, we augment a simplistic likelihood-based 
approach with a two-step process to unambiguously identify 
concept instance nodes. In the first step, a set of candidate 
partition tree nodes for a concept is generated. In the second 
step, a bipartite graph based technique is used to produce a 
set of unambiguous (concept(c), node(n)) pairs. Each (c, n) 
pair means that the subtree rooted under the node n in the 
partition tree is an instance of the concept c. 

Candidate Generation: The aggregation of semantically 
related items by structural analysis results in the content of 
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Table 1: (a) Major Headlines News Concept Questions (b) Category News Concept Questions 


a subtree rooted at a partition tree node being: (i) “close” 
to the content in the subtrees rooted at its children, and (ii) 
“distant” from the content in the subtrees of its immediate 
sibling nodes. For instance, in Figure 2(b), the likelihood of 
the dotted group node being an instance of major headlines 
concept is close to its children pattern nodes while being dis- 
tant from its sibling group node. To compute this we used 
two thresholds, tenia and tnor, to define the notions “close” 
and “distant” respectively. A node is a candidate concept 
instance if and only if it’s average likelihood deviation from 
it’s siblings is greater than t,», and average likelihood devi- 
ation from it’s children is less than tenia. 

Ambiguity Resolution: Since the same node can be a 
candidate for different concepts, ambiguities can arise. We 
represent the association between concepts and candidate 
nodes as a bipartite graph — the set of concepts C, and the 
set of candidate nodes P are the two disjoint sets of vertices 
in the graph. An edge between c; € C to pk € P is created 
if pp E€ Candidate(c;). The idea behind bipartite graph- 
based ambiguity resolution is as follows: First we form the 
set S; for every concept ci. S; consists of nodes that only 
match ci. Now pick that node px in S; with the maximum 
likelihood value to unambiguously represent an instance of 
the concept ci. We remove all the other edges from c; to 
any pı, l Æ k from the graph. This computation is repeated 
until it is not possible to derive any more 1-1 associations 
between concepts and partition nodes. 


3. EXPERIMENTAL RESULTS 


Experimental Setup: We implemented a prototype se- 
mantic bookmarking system based on the integration of logi- 
cal structure of Web pages with feature learning as described 
in the previous section. This prototype system was executed 
in a desktop environment where semantic bookmarks were 
learned from training Web pages and were subsequently used 
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to retrieve concept instances from a collection of test Web 
pages. Each Web page, training or test, was transformed 
into a partition tree and features extracted from every node 
in the tree. Instances of concepts, which were to be book- 
marked, were manually identified in the training pages and 
their corresponding nodes in the partition trees accordingly 
labeled. The features in these labeled partition nodes were 
used to learn the concept models. The learned models were 
applied on the partition trees of the training pages and like- 
lihood values computed for every concept at every node. 
The tnor and tenia thresholds for a concept were determined 
by analyzing it’s likelihood values at immediate siblings and 
children nodes, respectively, of it’s labeled partition nodes. 
Finally, the trained models and the computed thresholds 
were applied on the partition trees of test Web pages and 
concept instance nodes identified. 

In our current prototype system, identification of concept 
instances in training pages is performed by manually explor- 
ing the corresponding partition trees. Unlike DOM trees, 
the partition tree of a Web page produced as a result of 
structural analysis is quite shallow. Thus, it is not very dif- 
ficult for an user to navigate to the node in the logical struc- 
ture whose subtree corresponds to the concept instance of 
interest.” 

The objective of our experiments was to compare seman- 
tic bookmarking against normal browsing for focused con- 
tent retrieval in handheld devices. To this extent, we have 
concentrated on a quantitative assessment of our semantic 
bookmarking technique. We measured two metrics — time 
and I/O gestures (pen taps) users need to complete a set 
of focused browsing tasks with and without semantic book- 
marking. These metrics were measured in a PocketPC emu- 
lator which simulates a handheld browsing environment. In- 


3Incorporating an user-friendly interface for giving training 


examples is a work in progress. 


X 102 
50 53.7 
F m No Bk. 
40 F 38.2 Bk. 
E 22 30.8 
4 30- 
2 
2 F 22.7222 
A 20- iss 18 jéé 
F 12.5 11.5 sé 
10; #7, 8.4 : = 
Ik 3.7 4.8 
(0) 
M1 M2 M3 M4 M5 M6 M7 M8 M9 


(a) 


24 
j m No Bk. 20.5 
20} Bk. 
17.8 
L 16.7 
g 15b 14.8 
2 L 
2 10 
& 10 9.1 
8 
| 7.3 75 W7: 
L 5.9 : z5 
3.4 
L 1.8 
(0) 
Cl C2 C3 C4 C5 C6 C7 C8 c9 


Figure 3: Time taken, with and without Semantic Bookmarks, for answering the questions in (a) Major 
Headlines News Concept, and (b) Category News Concept 


stead of a pen or a navigation button, users perform the ver- 
tical and horizontal shift operations in the emulator browser 
with mouse clicks on the emulator button. Figure 1(c) shows 
such an emulator. 

The test Web pages used in the desktop experiment were 
loaded up into the emulator environment. In addition the 
content of the concept instances identified by our learning 
algorithm were converted into HTML. Images present in the 
original Web page were preserved in the HTML conversion 
while scripts were removed. This HTML conversion corre- 
sponds to retrieving the semantic bookmark and rendering 
it on the handheld’s Web browser. 

Both the test Web page as well as the bookmarked con- 

tent extracted from the test page were loaded into the Pock- 
etPC emulator. Evaluations were conducted on these loaded 
pages. 
Subjects, Domains, and Tasks: We used 10 subjects as 
evaluators. The subjects were chosen based on their famil- 
iarity with handheld devices. Each of them had used at least 
one handheld device, usually a cell phone, for over a year. 
All the subjects were computer science graduate students 
who were comfortable with our test setup. 

We selected the news domain and the travel domain for 
evaluation. These two domains possess dynamic content 
and are also quite popular among Web users. Prior to the 
experiment, the subjects were made familiar with the layout 
of the content in the pages chosen in the two domains. This 
conforms to the notion that that users bookmark content 
from familiar and frequently visited pages. 

Subjects were given a questionnaire and their task was 
to answer it w.r.t the information content in test page and 
the bookmarked content loaded in the handheld. The tasks 
were divided into three categories with increasing levels of 
difficulty: 


e Answering questions from single Web pages. 


e Answering questions that require comparing informa- 
tion from a set of Web pages. 


e Answering questions that require exhaustively reading 
the retrieved bookmark from all of the Web pages. 


The motivation behind this gradation of tasks was to eval- 
uate the effectiveness of semantic bookmarking for compre- 
hending information not just from a single page but from a 
collection of pages in the same domain. 
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We used the front pages of 8 news portals as the test set for 
our experiments on the news domain. In each of these pages, 
we identified two semantic concepts Major Headlines News 
and Category News. The content in these concept instances 
are very dynamic in nature and as such are suitable to be 
bookmarked. Two front pages, one each from New York 
Times and CNN, were used for training purposes*. Table 1 
shows the tasks for the concepts in the news domain. The 
first column in each concept’s table corresponds to the task 
number, while the second column is a news site, and the 
third column is the question which has to be answered from 
the front page of that site in the test set. The first 8 tasks 
for both the news concepts are single page questions, while 
question 9 compares four Web pages, and the last question 
is exhaustive in nature. 

The front pages of Expedia, Priceline, and Orbitz were 
used for evaluation in the travel domain. The semantic con- 
cept of Travel Deals, which shared the dynamic content na- 
ture of the news concepts, was used for bookmarking. An 
Expedia front page was used for training this concept. Ta- 
ble 3(a) shows the tasks in the travel domain for this con- 
cept. Questions D1, D2, and D3 are single page questions, 
while D4 is across pages, and answering D5 requires exhaus- 
tive enumeration of all the deals in all the three pages. 

Each subject was required to answer all the 20 questions 
from the news domain as well as all the 5 questions from 
the travel domain. In order to smooth the effect of the or- 
der of experimentation, each of 5 randomly chosen subjects 
answered the questions first with and then without semantic 
bookmarking. The remaining 5 subjects carried out the ex- 
periments in the reverse order. Moreover, for each subject, 
a time gap of 7 days was observed between answering the 
first and second sets of questions. Since we did not discern 
any noticeable difference between the two groups of subjects, 
i.e. those who evaluated first with semantic bookmarks and 
those who evaluated first without semantic bookmarks, the 
results shown in the following subsections are averaged over 
all the 10 users. 

Results on Time: Figures 3(a) and (b) show the time 
taken, averaged over all the 10 subjects, to accomplish the 
first nine tasks in the Major Headlines News and Category 
News concepts respectively. In both the figures, the shaded 
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Figure 4: Number of Pen Taps required, with and without Semantic Bookmarks, for answering the questions 
in (a) Major Headlines News Concept, and (b) Category News Concept 


bars correspond to time taken without semantic bookmark- 
ing while the checkered bars correspond to time with seman- 
tic bookmarking of the corresponding concept. The numbers 
do not include the time taken to load up the pages in the em- 
ulator browser since we were concerned only with comparing 
the information comprehension times between the two ap- 
proaches. For the same reason, the numbers do not include 
the (insignificant) time required to compute the semantic 
bookmark also. 

Observe the significant decrease in time with the use of 
semantic bookmarking for both the concepts. For the Major 
Headlines News concept this decrease ranges from 84.36% 
in M5 to 2.2% in M4 with an average decrease of 47.37% 
over the first eight tasks. In the Category News concept this 
decrease ranges from 89.22% in C4 to 6.25% in C7 with an 
average decrease of 46.53% over Cl to C8. For the cross 
page questions, M9 and C9, there are decreases of 69.80% 
and 67.77% in time respectively. The decrease in times, for 
both the concepts, varies between sites due to the difference 
in layout styles among them. Thus, while the layout of ma- 
jor headlines news in Financial Times (M4) facilitates easy 
browsing even without semantic bookmarking, the complex 
layout of the Houston Chronicle major headlines news (M5) 
provides evidence of the usefulness of semantic bookmark- 
ing. For most of the tasks in Figures 3(a) and (b), the Cat- 
egory News concept times are less than the corresponding 
times in Major Headlines News. This is due to the organiza- 
tion of category news into subcategories which makes infor- 
mation access easier. The time portions in Table 2 show the 
effect of semantic bookmarking for the exhaustive questions 
M10 and C10. Averaged over all the eight sites, the de- 
creases in time are 50.05% and 41.02% for Major Headlines 
News and Category News respectively. 

Similar decreases in time are also observed for the tasks 
related to the Travel Deals concept in the travel domain 
as shown in Figure 5 and Table 3(b) (time portions). The 
increased average decrease in time over D1, D2, and D3, 
84.5%, compared to the news domain is due to the very 
complex layout of information with forms and search boxes 
in travel front pages. 

Results on I/O: Figures 4(a) and (b) show the decrease in 
I/O gestures, i.e. pen taps, averaged over all the 10 subjects 
with the use of semantic bookmarking in the news domain. 
For Major Headlines News, this decrease ranges from 94.32% 
in M5 to 9.52% in M7 with an average decrease of 63.11% 
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over the first eight tasks. Similarly, for Category News the 
decrease ranges from 92% in C4 to 22.53% in C7 with an 
average decrease of 62.78% over C1 to C8. The cross page 
questions, M9 and C9, have decreases of 77.34% and 74.94% 
respectively. Table 2 shows the decrease in pen taps for the 
exhaustive questions M10 and C10. Averaged over all the 
eight pages, there are decreases of 65.86% and 57.78% for 
M10 and C10 respectively. 

The average decrease in pen taps for the Travel Deals con- 

cept, as shown in Figure 5, over D1, D2, and D3 is around 
91.87%. Similar decrease in pen taps are also observed for 
the cross page question D4 and the exhaustive question D5 
as shown in Figure 5 and Table 3(b) respectively. 
Results on Bandwidth: In a mobile handheld environ- 
ment, the bandwidth of the wireless network poses con- 
straints on the amount of data that can be transmitted. 
Table 4 summarizes our findings on the bandwidth savings 
which could be accomplished by the use of semantic book- 
marks. The first column in Table 4 indicates the front page 
of the Web site, the second column shows the total num- 
ber of bytes including images, scripts, and plain HTML for 
that page, the third column (C3) shows the total number 
of bytes without scripts, and the fourth column (C4) shows 
the total number of bytes without images and scripts. The 
first column (C5) in each news concept shows the number of 
bytes, including images but excluding scripts, for that con- 
cept instance in the corresponding Web page. The second 
column in each news concept shows the %age reduction of 
Cs over C3 while the third column shows the %age reduction 
of Cs over C4. Observe the significant reduction in band- 
width in most of the pages and across both the concepts 
even when semantic bookmarks with images is compared to 
original Web page without images. This indicates the utility 
of semantic bookmarking, from a hardware perspective, for 
focused repetitive browsing activities. 


4. RELATED WORK 


The problem of creating, retrieving and evaluating the 
effectiveness of personalized semantic bookmarks for hand- 
helds is a relatively new topic in the literature. The areas 
closely related to this work include content adaptation for 
small-screen devices, wrappers for data extraction, and the 
Semantic Web. 

Initial efforts at adapting Web content onto handhelds re- 
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Table 2: Exhaustive Question (M10 and C10) for News Domain Concepts 


Is there a deal to Florida? 
Is there a deal to Florida? 
Is there a deal to Florida? 


What is the cheapest deal to Florida 
from Expedia, Orbitz, and Priceline? 
How many deals there are? 
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Table 3: (a) Travel Deals Concept Questions (b) Pen Taps and Time required, with and without Semantic 


Bookmarks, for answering Question D5 


lied on WML (Wireless Markup Language) and WAP (Wire- 
less Application Protocol) for designing and displaying Web 
pages [23, 5, 21]. That these approaches impose additional 
burden on Web page authors to create separate WML con- 
tent, led to work on automatic adaptation of normal Web 
content onto small screen devices (see [6, 9, 8, 7, 35, 12, 
38, 20, 10, 2]). These works have focused on organizing the 
Web page into tree structures and summarizing its content. 
While they are effective for ad-hoc exploratory browsing, 
summary structures cause needless navigational steps when 
a user is only interested in targeted content. Our technique 
only presents the desired information and our evaluation re- 
sults indicate that it mitigates browsing fatigue caused by 
needless navigation. 

Recall that our technique partitions Web pages into se- 


mantically related units prior to building the statistical model. 


Web page partitioning techniques have been proposed for 
adapting content onto small screen devices [6, 9, 8, 7, 12, 38, 
4, 25]. Related partitioning techniques have also been pro- 
posed for other applications like content caching [32], Web 
page cleaning and data mining [36, 37, 3], Web search [39], 
schema extraction [13], and displaying content in a browser 
[26]. Unlike our approach, these works do not associate con- 
tent semantics with consistency of presentation style and 
spatial locality — the key to inferring the logical structure of 
a page organized around its content semantics. Semantically 
related items are more accurately identified and aggregated 
together at various levels of granularity by content analysis 
based on this idea. Learning salient features of partitions 
constituting such aggregated items enables users to create 
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Figure 5: Pen Taps and Time required, with and 
without Semantic Bookmarks, for answering Ques- 
tions D1 to D4 


and retrieve succinct semantic bookmarks which precisely 
correspond to the desired content. The idea of learning fea- 
tures of Web page segments was recently explored in [33]. 
Apart from the difference in the application scenario — data 
cleaning in [33] vs. our semantic bookmarking — their learn- 
ing setting does not utilize the presentational aspects of the 
content. But the fundamental difference between our work 
and all the above works is that we tightly integrate the log- 
ical structure of Web pages with feature learning. It is this 
tight coupling that facilitates identification of the more dis- 
tinguishing characteristics of concepts thereby leading to the 
creation and retrieval of semantic bookmarks with a high 
degree of precision. 
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Table 4: Bandwidth Savings from Semantic Bookmarks in the News Domain Pages 


Semantic bookmarking is also related to the extensive re- 
search on manually or semi-automatically constructing wrap- 
pers for data extraction from Web pages (see [24] for a sur- 
vey on wrappers). However, being syntax-based, wrappers 
are sensitive to structural changes in the Web page. In ad- 
dition, they are page-specific. Recent approaches to auto- 
mated wrapper construction also rely on syntax-based solu- 
tions [1, 15, 11] (such as assuming a common schema or using 
specific tags as record boundary separators). In contrast, se- 
mantic bookmarking is resilient to structural changes. As 
long as the features associated with the bookmarked con- 
cept are sufficiently preserved in a Web page, the content 
corresponding to the concept instance in the page can be 
retrieved. Moreover, the scope of semantic bookmarking 
extends to pages drawn from different Web sites that share 
a common application domain. The notion of using “seman- 
tics” for wrapper learning was only very recently discussed 
in [34]. However, their use of semantics is limited to simple 
words and does not make use of presentational aspects of 
content. Moreover, unlike ours, the work in [34] does not 
involve inferencing of logical structures of Web pages. 

The Semantic Web has spurred research on making Web 
pages machine understandable. To realize the Semantic 
Web one has to annotate Web pages with semantic meta- 
information. Powerful ontology management systems and 
knowledge bases have been used for interactive annotation 
of web pages [19, 22, 18] or have been combined with lin- 
guistic analysis for fully automated approaches [16, 30, 14, 
29]. While ontologies and knowledge bases can be used for 
semantic bookmarking via Semantic Web browsers [17, 31] 
they however restrict the user to only those concepts defined 
in them. In contrast, use of machine learning facilitates cre- 
ation of personalized ad-hoc semantic bookmarks. Such a 
degree of personalization not only gives users the flexibility 
to define their own view of semantic concepts but also pro- 
vides them with a transparent workaround when a desired 
concept does not exist in the knowledge base. 

Finally, partitioning documents into distinct segments is 
related to work on topic detection [40]. However, in con- 
trast to typical topic detection works on unstructured text, 
our techniques analyze semi-structured HTML documents 
where use is made of their additional structural information. 
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5. DISCUSSIONS 


In this paper, we have reported on a preliminary quantita- 
tive evaluation of learning-based semantic bookmarking on 
handheld devices. While further statistical analysis of the 
data is required, we believe it is also important to measure 
the qualitative impact of the technology on users. In partic- 
ular, it would be interesting to assess user response to the 
loss of surrounding context versus the browsing efficiency 
gained by focused content delivery. 

From an experimental perspective, it is worthwhile eval- 
uating the effectiveness of semantic bookmarking on actual 
handhelds in a real-world wireless setting. Such a setting can 
give rise to additional usability issues that may not mani- 
fest themselves in an emulator environment. Implementa- 
tion of semantic bookmarking in such a real environment is 
in progress. 

Acknowledgments: This work was supported in part 
by the NSF grants CCR-0205376 and CCR-0311512. 
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